residual weight
Sobolev neural network with residual weighting as a surrogate in linear and non-linear mechanics
Kilicsoy, A. O. M., Liedmann, J., Valdebenito, M. A., Barthold, F. -J., Faes, M. G. R.
Areas of computational mechanics such as uncertainty quantification and optimization usually involve repeated evaluation of numerical models that represent the behavior of engineering systems. In the case of complex nonlinear systems however, these models tend to be expensive to evaluate, making surrogate models quite valuable. Artificial neural networks approximate systems very well by taking advantage of the inherent information of its given training data. In this context, this paper investigates the improvement of the training process by including sensitivity information, which are partial derivatives w.r.t. inputs, as outlined by Sobolev training. In computational mechanics, sensitivities can be applied to neural networks by expanding the training loss function with additional loss terms, thereby improving training convergence resulting in lower generalisation error. This improvement is shown in two examples of linear and non-linear material behavior. More specifically, the Sobolev designed loss function is expanded with residual weights adjusting the effect of each loss on the training step. Residual weighting is the given scaling to the different training data, which in this case are response and sensitivities. These residual weights are optimized by an adaptive scheme, whereby varying objective functions are explored, with some showing improvements in accuracy and precision of the general training convergence.
- Europe > Switzerland (0.04)
- Europe > Portugal > Braga > Braga (0.04)
- Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Dortmund (0.04)
ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers
Memory constraint of always-on devices is one of the major concerns when deploying speech processing models on these devices. While larger models trained with sufficiently large amount of data generally perform better, making them fit in the device memory is a demanding challenge. In this paper, we aim to reduce model size by reparameterizing model weights across Transformer encoder layers and assuming a special weight composition and structure. More specifically, inspired by ResNet and the more recent LoRA work, we propose an approach named ResidualTransformer, where each weight matrix in a Transformer layer comprises 1) a shared full-rank component with its adjacent layers, and 2) a unique low-rank component to itself. The low-rank matrices only account for a small amount of model size increase. In addition, we add diagonal weight matrices to improve modeling capacity of the low-rank matrices. Experiments of our 10k-hour speech recognition and speech translation tasks show that the Transformer encoder size can be reduced by ~3X with very slight performance degradation.
- North America > United States > Washington > King County > Redmond (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
DRGCN: Dynamic Evolving Initial Residual for Deep Graph Convolutional Networks
Zhang, Lei, Yan, Xiaodong, He, Jianshan, Li, Ruopeng, Chu, Wei
Graph convolutional networks (GCNs) have been proved to be very practical to handle various graph-related tasks. It has attracted considerable research interest to study deep GCNs, due to their potential superior performance compared with shallow ones. However, simply increasing network depth will, on the contrary, hurt the performance due to the over-smoothing problem. Adding residual connection is proved to be effective for learning deep convolutional neural networks (deep CNNs), it is not trivial when applied to deep GCNs. Recent works proposed an initial residual mechanism that did alleviate the over-smoothing problem in deep GCNs. However, according to our study, their algorithms are quite sensitive to different datasets. In their setting, the personalization (dynamic) and correlation (evolving) of how residual applies are ignored. To this end, we propose a novel model called Dynamic evolving initial Residual Graph Convolutional Network (DRGCN). Firstly, we use a dynamic block for each node to adaptively fetch information from the initial representation. Secondly, we use an evolving block to model the residual evolving pattern between layers. Our experimental results show that our model effectively relieves the problem of over-smoothing in deep GCNs and outperforms the state-of-the-art (SOTA) methods on various benchmark datasets. Moreover, we develop a mini-batch version of DRGCN which can be applied to large-scale data. Coupling with several fair training techniques, our model reaches new SOTA results on the large-scale ogbn-arxiv dataset of Open Graph Benchmark (OGB). Our reproducible code is available on GitHub.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Research Report > New Finding (0.66)
- Research Report > Promising Solution (0.48)
Scaling up deep neural networks: a capacity allocation perspective
Capacity analysis has been introduced in [2] as a way to analyze which dependencies a linear model is focussing its modelling capacity on, when trained on a given task. The concept was then extended in [3] to neural networks with nonlinear activations, where capacity propagation through layers was studied. When the layers are residual (or differential), and in one limiting case with extremely irregular activations (which was called the pseudo-random limit), it has been shown that capacity propagation through layers follows a discrete Markov equation. This discrete equation can then be approximated by a continuous Kolmogorov forward equation in the deep limit, provided some specific scaling relation holds between the network depth and the scale of its residual connections - more precisely, the residual weights must scale as the inverse square root of the number of layers. Following [1], it was then hypothesized that the success of residual networks lies in their ability to propagate capacity through a large number of layers in a non-degenerate manner. It is interesting to note that the inverse square root scaling mentioned above is the only scaling relation that leads to a non-degenerate propagation PDE in that case: larger weights would lead to shattering, while smaller ones would lead to no spatial propagation at all. In this paper, we take this idea one step further and formulate the conjecture that enforcing the right scaling relations - i.e. the ones that lead to a non-degenerate continuous limit for capacity propagation - is key to avoiding the shattering problem: we call this the neural network scaling conjecture. In the example above, this would mean that the inverse square root scaling must be enforced if one wants to use residual networks at their full power. In the second part of this paper, we use the PDE capacity propagation framework to study a number of commonly used network architectures, and determine the scaling relations that are required for a non-degenerate capacity propagation to happen in each case.